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Abstract 

The best currently known interactive debugging 
systems rely upon some meta-information in terms 
of fault probabilities in order to improve their ef- 
ficiency. However, misleading meta information 
might result in a dramatic decrease of the per- 
formance and its assessment is only possible a- 
posteriori. Consequently, as long as the actual 
fault is unknown, there is always some risk of sub- 
optimal interactions. In this work we present a 
reinforcement learning strategy that continuously 
adapts its behavior depending on the performance 
achieved and minimizes the risk of using low- 
quality meta information. Therefore, this method 
is suitable for application scenarios where reliable 
prior fault estimates are difficult to obtain. Us- 
ing diverse real-world knowledge bases, we show 
that the proposed interactive query strategy is scal- 
able, features decent reaction time, and outper- 
forms both entropy-based and no-risk strategies on 
average w.r.t. required amount of user interaction. 

1 Introduction 

Efficient debugging is a prerequisite for successful evolution, 
maintenance and application of knowledge -based systems. In 
a standard application scenario a debugger deals with a faulty 
knowledge base (KB) O which fails to meet predefined qual- 
ity criteria R such as consistency. The task of debugging 
aims at modifying O in that a (subset-)minimal set of axioms 
DC0, termed diagnosis, is deleted in order to restore com- 
pliance of the KB with R, whereas a set of axioms EXd is 
inserted to O to preserve designated entailments which might 
have been broken by the removal of T>. Usually, a large num- 
ber of competing diagnoses exist for a faulty O. Without ad- 
ditional information, there is no means to decide which T> to 
prefer. In many practical scenarios, however, there is some 
kind of meta information available, for example in terms of 
(1) logs of prior debugging sessions, (2) common faults or 
fault patterns occurring in logical formulas, or (3) a subjective 
guess of the involved user based on their experience. Given 
such data, one can extract a-priori fault probabilities and use 
them to guide the search for diagnoses. For example, one 
could use a uniform cost strategy to find th e most probable di - 
agnosis w.r.t. fault probabilities, see e.g. [Kalyanpur, 20061. 



However, only in the best case, if the fault probabilities are 
perfectly adjusted for the particular case, this will lead the 
search to the desired diagnosis the deletion of which enables 
to formulate a KB compliant with the requirements defined 
by the user. 

Interactive debugging systems su ch as I Shchekotykhin et 



al, 2012 Siddiqi and Huang, 2011) tackle this issue by let- 



ting an oracle take action during the debugging session by 
answering queries. In case of KBs a debugger asks about en- 
tailments and non-entailments of the desired Ot, called test 



cases [Shchekotykhi n et al, 2012) . These pose constraints to 
the validity of diagnoses and thus help to sort out incompli- 
ant diagnoses and update the probabilities of remaining ones 
step-by-step. However, often a debugger can find many alter- 
native queries for a set of diagnoses. Selection of the "best" 
query, an answer to which allows to obtain maximum infor- 
mation, is very important since it affects the total number of 
queries required to localize th e fault. In their seminal work 
Ide Kleer and Williams, 19871 proposed two query selection 
strategies: split-in-half and entropy-based. The latter strategy 
can make optimal profit from exploiting properly adjusted ini- 
tial fault probabilities, whereas it can completely fail in the 
case of weak prior information. The split-in-half manifests 
constant behavior independently of the probabilities given, 
but lacks the ability to leverage appropriate fault information. 
Selection of the best strategy is problematic, since one has to 
decide about the quality of the prior fault probabilities with- 
out knowing the desired solution. Our evaluation shows that 
selection of an inappropriate strategy can result in a substan- 
tial increase of more than 2000% w.r.t. number of queries. 

The contribution of this paper is a new Risk Optimization 
reinforcement learning method (RIO). Compared to existing 
strategies RIO allows to minimize user interaction in the av- 
erage case for any quality of meta information. By virtue of 
its learning capability, our approach is optimally suited for 
debugging of KBs where only vague or no meta information 
is available. Moreover, RIO uses the acquired information 
to adapt its learning strategy. On the one hand, our method 
takes advantage of the given meta information as long as good 
performance is achieved. On the other hand, it gradually gets 
more independent of meta information if suboptimal behavior 
is measured. Experiments on two datasets of faulty ontolo- 
gies show the feasibility, efficiency and scalability of RIO. 
The evaluation will indicate that, on average, RIO is the best 
choice of strategy for both good and bad meta information 



with savings as to user interaction of up to 80%. 

Technical preliminaries are provided in Section [2] Sec- 
tion [3] explains the suggested approach and gives implemen- 
tation details. Evaluation results are described in Section [4] 
Section concludes . 

2 Preliminaries 

In order to make the paper self-contained we provide a short 
introduction to description logic (DL), which is a knowledge 
representation and reasoning system (KRS) used in the pa- 
per. Of course, the approach suggested in this work is not 
limited to DL and can be applied to any KRS for which there 
is a sound and complete reasoning method and the entailment 
relation is extensive, monotone and idempote nt. 

Description logic flBaader et al., 2003| is a family of 
knowledge representation languages with a formal logic- 
based semantics that are designed to represent knowledge 
about a domain in form of concept descriptions. The syntax 
of a language C is defined by its signature (vocabulary) and 
a set of constructors. A signature in this case corresponds to 
a (disjoint) union of sets Nc, Nr and Nj, where Nc con- 
tains all concept names (unary predicates), Nr comprises all 
role names (binary predicates) and Nj is a set of individuals 
(constants). Each concept and role description can be either 
atomic or complex. The latter ones are composed using con- 
structors defined in the particular language C. A typical set 
of DL constructors includes conjunction An B, disjunction 
AU B, negation -iA, existential 3r. A and value Vr. A restric- 
tions, where A, B E Nc and r € Nr. 

A DL ontology O is defined as a tuple (T,A), where T 
(TBox) is a set of terminological axioms and A (ABox) a set 
of assertional axioms. Each TBox axiom is expressed by a 
general concept inclusion A C C, a form of logical implica- 
tion, or by a definition A = C,a kind of logical equivalence, 
where C is an atomic or complex concept. ABox axioms are 
used to assert properties of individuals in terms of the vo- 
cabulary defined in TBox, e.g. concept A{x) or role r(x, y) 
assertions, where x,y G Nj. 

The semantics of DLs is given in terms of interpretations 
I = (A , • x ) consisting of a non-empty domain A 1 and a 
function - x that maps each concept to a subset of A 1 , each 
role to a subset of A 1 x A 1 and each individual to some value 
in A 1 . An interpretation I is a model of O iff it satisfies 
all TBox and ABox axioms. O is unsatisfiable iff it has no 
model. A concept A (role r) is satisfiable w.r.t O iff there is a 
model X of O with A 1 ^ (r 1 ^ 0). A TBox is incoherent 
iff there exists an unsatisfiable concept or role. 

Usually description logic systems provide sound and com- 
plete reasoning services to their users. In addition to verifi- 
cation of coherence and consistency of O, the reasoners also 
perform classification and realization. Classification is a sub- 
sumption algorithm that determines most specific (general) 
concepts that subsume (are subsumed by) a certain concept. 
Realization computes for each individual x a set of most spe- 
cific concepts {Ci, . . . , C n } such that O \= Ci(x) for all 
i = 1, . . . ,n. Note, when we speak of entailments below, 
we address (only) the output computed by the classification 
and realization services of a DL-reasoner. 

Ontology debugging, given an ontology O, aims at approx- 
imating the so-called target ontology O t by O*, where O t 



is some correct and complete ontology that satisfies all re- 
quirements to the knowledge-based application it is used for. 
O* must satisfy all explicitly stated requirements and is thus 
termed complying ontology. It results from modifications to 
O in terms of ( 1 ) deleting axioms T> and (2) inserting axioms 
EX V . We call V = O \ O* a diagnosis. 

Definition 1 (Complying Ontology, Diagnosis Problem) 

Let O be an ontology, B a background KB, R a set of 
requirements to O, P and N respectively a set of positive 
and negative test cases, where each test case p G P and 
n G N is a set of axioms. Then an ontology O* is called 
complying ontology iff all the following conditions hold: 

VrGi? : O* US fulfills r (1) 

y p G P : O* U B \= p (2) 

Vn G N : O'UB^n (3) 

The tuple (0,B,P,N)r defines a diagnosis problem in- 
stance (DPI). 

Often R := {coherence, consistency} is assumed. 

Definition 2 (Diagnosis) V <Z O is called diagnosis for a 
DPI (O, B, P, N) R iff there is a set of axioms EX-p such 
that (0\T>)U EXt> is a complying ontology. A diagnosis T> 
assumes that all axi G T> are faulty and all axj G 0\T> are 
correct. A diagnosis T> is minimal iff there is noT>' C T> s.t. 
T>' is a diagnosis. MD denotes the set of minimal diagnoses 
of a DPI. 

Note that MD is usually used to approximate the set of all di- 
agnoses of a DPI. The identification of EXd, accomplished 
e.g. by some learning approach, is a crucial part of the on- 
tology repair process. However, the complete formulation of 
EXx> is outside the scope of this wor k where we focus on 
comp uting diagnoses. As suggested in [ Shchekotykhin et al, 
120121, we approximate EXx> by the set(J pgP p. Given a 
DPT(e>, B, P, N) R , if the set of axioms O U (j peP p is not 
a complying ontology then there is no diagnosis T> = 0, i.e. 
some axioms in O must be modified. 

Example 1: Consider O := O x U 2 U M12 with TBox T : 

0\ axi : PhD C Researcher 

ax2 : Researcher C DeptEmployee 

02 ax-i : PhDStudent C Student 
ax4 : Student C -^DeptMember 

M12 ax 5 : PhDStudent C. PhD 

axe ■ DeptEmployee C DeptMember 

and ABox A = {PhDStudent(s)}, where M12 is an au- 
tomatically generated set of semantic links between 0\ and 
02- The given ontology O is inconsistent since it describes s 
as both a department member and not. Let the DPI be defined 
as (T, A, 0, 0) {coherence}' wnere <A. is correct and thus added to 
the background theory and both sets P and N are empty. For 
this DPI MD = {T>i : [axi\,V 2 : [ax 2 ],T> 3 : [ax 3 ],X> 4 : 
[0x4] : [0x5], X>6 : [ aa: 6]}- To comput e MD we em- 
ploy a combination o f HS-Tree ]Reiter, 1987) an d QuickX- 
Plain 1 Junker, 2004) algo rithms as suggested by (Friedrich 
and Shchekotykhin, 2005| . 

Interactive ontology debugging iteratively incorporates a 
user's knowledge about Ot, thereby differentiating between 
diagnoses in MD. The overall procedure is as follows: 



(1) Compute a set of at most n leading diagnoses D C MD 
that serve as an approximation of all minimal diagnoses MD. 
Restricting the computation of MD to a predefined number 
n helps to overcome exponential explosion of HS -Tree. Pref- 
erence criteria such as most probable or minimum cardinality 
diagnoses are used to specify D within MD. (2) Exploit D 
to compute/select a query which is posed to the user. (3) In- 
corporate the user's answer to prune the search space for di- 
agnoses. Go to (1) until a predefined stop criterion is met by a 
T>* G D, e.g. T>* has overwhelming probability. We call the 
priorly unknown diagnosis that will meet the stop criterion 
target diagnosis T>* . As a means for interaction with the user 
we utilize the notion of a query which means asking the user 
(Ot \= Xjl), i.e. to classify whether a given set of axioms Xj 
should be entailed (assigned to P) or not entailed (assigned 
to N) by Ot- The theoretical foundation for the application 
of queries is the fact that 0\V t and 0\T>j for V, ^D,eD 
entail different sets of axioms. 

Definition 3 (Query, Partition) Let O* := (O \ Dj) UBU 

(UpepP) where T>i G D. A set of axioms Xj is called a 



Algorithm 1: Query Generation 



query iff D 



{Vt G D|C* \=XA ^ andU 



N 



{2^ G D | O* \= -iXj} ^ 0. The partition of query Xj is de- 
noted by <Df , Df , D|) where D| = D \ (Df U Df ). X D 
terms the set of all queries and associated partitions w.r.t. D. 

The (complete) set Xd can be generated as shown in Algo- 
rithm[T] In each iteration, given a set of diagnoses D^ C D, 
common entailments Xf. := {e | VX>j 6 D[ : 0* |= e} are 
computed (function getEntailments) and used to classify 
the remaining diagnoses in D \ D^ to obtain the partition 
(D^, D^, D^) associated with Xk- Then, together with its 
partition, X k is added to X D . The function iNCONSlST(arg) 
returns true if arg is inconsistent or incoherent. 

Let the answering of queries by a user be modeled as func- 
tion u : X D — > {t,f}. If Uj := u(Xj) = t, then P -s— 
P U {Xj} and D <- D \ T)f. Otherwise, N <- N U {Xj} 
and D D \ Dj 3 . Prospectively, according to Definition pi 
only those diagnoses are considered in the set D that comply 
with the new DPI obtained by the addition of a test case. This 
allows us to formalize the problem we address in this work: 

Problem Definition (Diagnosis Discrimination) Given D 

w.r.t. (0,B,P,N)r, a stop criterion stop : D — > {t, /} 
and a user u, find a next query Xj G Xrj such that 
(1) (Xj, . . . , X q ) is a sequence of minimal length and (2) af- 
ter X G {Xj, . . . , X q } are added to P and N according to 
{uj, . . . , u q }, there exists a T>* G D such that stop(T>*) = t. 

Two str ategies for selecting the "best" n ext query have been 
proposed | |de Kleer a nd Williams, 1987 1 and ada pted to de- 
bugging of KBs by | |Shchekotykhin et al/T0 \2\. Split-in- 
half strategy (SPL), selects the query Xj G X D which min- 
imizes the scoring function sc sp n t (Xj) := 1 1 J 3 1 — |D^|| + 
|D®|. So, SPL prefers queries which eliminate half of the 
diagnoses independently of the query outcome. Entropy- 
based strategy (ENT) uses information about prior probabil- 
ities p t for the user to make a mistake when using a syntac- 
tical construct of type t G CT(C), where CT(C) is the set 
of constructors available in the used knowledge representa- 



Input: DPI { O , B , P , N) R , set of corresponding diagnoses D 
Output: a set of queries and associated partitions Xd 

1 foreach C D do 

2 X k «- getEntailments(0, B, P. Df); 

3 if X k ^ then 

4 foreach V r £ D \ do 

5 if O; |= X k thenDf <- U {V r }; 

6 elseif inconsist(0,! U X k ) thenD™ <- D™ U {V r }; 

7 elseD^ <- L>1 U {XV}; 



X r 



9 return X D ; 



XdU y,, Df.Df.D 



tion languag e C, e.g. {V, 3, C, -., U, n} c CT(OWL) jGrau 
et al, 2008) . These fault probabilities p t are assumed to be 
independent and used to calculate fault probabilities of ax- 
ioms axk as p(axk) = 1 — Lltecr^ ~ Pt) n ^ where n(t) 
is the number of occurrences of construct type t in axk- The 
probabilities of axioms can in turn be used to determine fault 
probabilities of diagnoses £ Das 



P (v l )= n p(< 

ax r £T>i 



n a 

ax e eO\T>i 



p{ax s )) (4) 



ENT selects the query Xj G Xd with highest expected in- 
formation gain, i.e. which minimizes sc en t(Xj) defined as: 



^2 P( u j= a ) ^2 -P( v k\u 3 ■= a)\og 2 p(V k \u 3 ■ = 



a£{t,f} 



where p( Uj = t) = Y,v r zv?P( v r) + |p(D«) , p(D 



EB r £D» PlPr) and P( l 



/) = 1 — p(uj = t). The an- 
swer Uj — a is used to update probabilities p(T>k) according 
to the Bayesian formula, yielding p(T>k\uj = a). T he re- 
sult of the evaluation in [ Shchekotykhi n et al, 2012") shows 
that ENT reveals better performance than SPL in most of the 
cases. However, SPL proved to be the best strategy in situa- 
tions when misleading prior information is provided, i.e. the 
target diagnosis T>* has low probability. So, one can regard 
ENT as a high risk strategy with high potential to perform 
well, depending on the priorly unknown quality of the given 
fault information. SPL, in contrast, can be seen as a no-risk 
strategy without any potential to leverage good meta informa- 
tion. Therefore, selection of the proper combination of prior 
probabilities {p t \ t G CT(C)} and query selection strategy is 
crucial for successful diagnosis discrimination and minimiza- 
tion of user interaction. 

3 Risk Optimization for Query Selection 

The proposed Risk Optimization Algorithm (RIO) extends 
ENT strategy with a dynamic learning procedure that learns 
by reinforcement how to select optimal queries. The behav- 
ior is determined by the achieved performance in terms of 
diagnosis elimination rate. Good performance means similar 
behavior to ENT, whereas aggravation of performance leads 
to a gradual neglect of the given meta information. Like ENT, 
RIO continually improves the prior fault probabilities based 
on new knowledge obtained through queries to a user. 



RIO learns a "cautiousness" parameter c whose admissible 
values are captured by the user-defined interval [c, c] . The 
relationship between c and queries is as follows: 

Definition 4 (Cautiousness of a Query) We define the cau- 
tiousness c g (Xj) o/a query X; as follows: 



Cq(Xi 



min{|Df|,|Df|} 



IDI 



o. 



IDI 



A query Xi is called braver than query Xj iff c q {Xi) < 
c q (Xj). Otherwise Xi is called more cautious than Xj. A 
query with maximum cautiousness is called no-risk query. 

Definition 5 (Efimination Rate) Given a query Xi and the 
corresponding answer Ui £ {t, /}, the elimination rate 

e(X t ,Ui) = ifui = t and e{X u Ui) = if u, = f. 
The answer Ui to a query Xi is called favorable iff it maxi- 
mizes the elimination rate e(Xi,Ui). Otherwise m is called 
unfavorable. The minimal or worst case elimination rate 

m ^ n ui£{tj}( e {Xi, u^) ofX t is denoted by e„(lj). 

So, the cautiousness c q (Xi) of a query Xi is exactly the worst 
case elimination rate, i.e. c q (Xi) — e wc (X,i) — e{Xi,Ui) 
given that m is the unfavorable query result. Intuitively, pa- 
rameter c characterizes the minimum proportion of diagnoses 
in D which should be eliminated by the successive query. 

Definition 6 (High-Risk Query) Given a query Xi and cau- 
tiousness c, Xi is called a high-risk query iff c q (Xi) < c, i.e. 
the cautiousness of the query is lower than the algorithm's 
current cautiousness value c. Otherwise, Xi is called non- 
high-risk query. By NHR c (X-d) C Xd we denote the set 
of all non-high-risk queries w.r.t. c. For given cautiousness 
c, the set of all queries Xd can be partitioned in high-risk 
queries and non-high-risk queries. 

Exampie 2 (cont. Exampie 1): Let the user specify 
c := 0.3 for the set D with |D| = 6. Given these 
settings, X\ := {DeptEmployee(s), Student (s)} is a 
non-high-risk query since its partition (DfjD^D^) = 

({D4,X»6},{X» 1) D 2 ,X»3,D 5 },r 
Cq(Xi) = 2/6 > 0.3 = 

{PhD(s)} with partition {{V u V 2 , T> 3 , £>4, ^e} , {V 6 } 
is a high-risk query because c q (X2) = 1/6 < 
0.3 = c and X3 := {Researcher (s), Student(s)} with 
{{T>2 : T>4, Vq] , {2?i, TJ3, T>§} , 0) is a no-risk query due to 
c q (X 3 ) = 3/6 = ~c~ q . 

Given a user's answer u s to a query X s , the cautiousness 
c is updated depending on the elimination rate e(X s , u s ) by 
c 4— c+c a di where the cautiousness adjustment factor c a dj ■— 
2 (c— c) adj. The scaling factor 2 (c— c) regulates the extent of 
the cautiousness adjustment depending on the interval length 
c — c. More crucial is the factor adj that indicates the sign 
and magnitude of the cautiousness adjustment. 

e(X s ,u s ) 

where e € (0, |) is a constant which prevents the algorithm 
from getting stuck in a no-risk strategy for even |D|. E.g., 
given c = 0.5 and s = 0, the elimination rate of a no-risk 
query e(X s , u s ) — \ resulting always in adj — 0. The value 



and thus its cautiousness 
c. The query X2 := 



adj := 



of e can be set to an arbitrary real number, e.g. e := \. 
If c + c ac ij is outside the user-defined cautiousness interval 
[c, c], it is set to c if c < c and to c if c > c. Positive c a dj is 
a penalty telling the algorithm to get more cautious, whereas 
negative c a dj is a bonus resulting in a braver behavior of the 
algorithm. Note, for the user-defined interval [c,c] C [c g ,c^] 
must hold. c—c q and ~Cq~ — c represent the minimal desired 
difference in performance to a high-risk (ENT) and no-risk 
(SPL) query selection, respectively. By expressing trust (dis- 
belief) in the prior fault probabilities through specification of 
lower (higher) values for c and/or c, the user can take influ- 
ence on the behavior of RIO. 

Exampie 3 (cont. Example 1): Assume p(axi) := 0.001 
for aXiu—i ^4) and p(ax^) := 0.1, p(ax 6 ) :— 0.15 and the 
user rather disbelieves these fault probabilities and thus sets 
c = 0.4, c = and c = 0.5. In this case RIO selects a 
no-risk query X3 just as SPL. Given u^ — t and |D| = 6, 
the algorithm computes the elimination rate e(X 3 ,t) = 0.5 
and adjusts the cautiousness by c a dj = —0.17 which yields 
c = 0.23. This allows RIO to select a higher-risk query in the 
next iteration, whereupon the target diagnosis T>* = T>2 is 
found after asking three queries. In the same situation, ENT 
(starting with high-risk query X\) would require four queries. 

RIO, described in Algorithm|2] starts with the computation 
of minimal diagnoses. GETDlAGNOSES function implements 
a combination of HS-Tree and QuickXPlain algorithms. Us- 
ing uniform-cost search, the algorithm extends the set of lead- 
ing diagnoses D with a maximum number of most probable 
minimal diagnoses such that |D| < n. 

Then the GETPROBABILITIES function calculates the fault 
probabilities p(T>i) for each diagnosis T>i of the set of leading 
diagnoses D using formula Q. Next it adjusts the probabili- 
ties as per the Bayesian theorem taking into account all previ- 
ous query answers which are stored in P and N . Finally, the 
resulting probabilities Padj^Pi) are normalized. Based on the 
set of leading diagnoses D, generateQueries generates 
queries according to Algorithm [T] getMinScoreQuery 
determines the best query X sc G Xd according to sc ent : 
X sc = argmin XfceXD (sc e „ t (X jt )). If X sc is a non-high-risk 
query, i.e. c < c q (X sc ) (determined by GETQueryCau- 
TIOUSNESS), X sc is selected. In this case, X sc is the query 
with best information gain in Xd and moreover guarantees 
the required elimination rate specified by c. 



Algorithm 2: Risk Optimization Algorithm (RIO) 

Input: diagnosis problem instance (0, B, P , N) r, fault probabilities of 

diagnoses DP, cautiousness C — (c, c, c), number of leading diagnoses 
n to be considered, acceptance threshold a 

Output: a diagnosis T> 

1 P <- 0; N <- 0: D <- 0; 

2 repeat 

3 D <— getDiagnoses(D, n, O, B, P, iV); 

4 DP <— getProbabilities(DP, D, P, N); 

5 X <— generateQueries(G>, 0, P, D); 

6 X s getMinScoreQuery(DP, X); 

7 if getQueryCautiousness(X s , D) < cthen 
X s <- getAlternativeQuery(c, X, DP, D); 

8 if getAnswer(X a ) = vcsthen P <- P U {X,}; 

9 else N <- N U {X„}{ 

10 c -r- updateCaut iousness (D, P , N , X s , c, c,c); 

11 until (aboveThreshold(-DP, cr) V eliminationRate(X 3 ) = Oj; 

12 return mostProbableDiag(D, DP); 



Otherwise, getAlternativeQuery selects the query 
X a it € X D (X a it ^ X sc ) which has minimal score sc ent 
among all least cautious non-high-risk queries L c . That is, 
X a it = argmin Xfc£L (sc ent {X k )) where L c := {X r G 
NHR C (X D ) | VX t G NHR C (X) : c 9 (X r ) < c 3 (X t )}. If 
there is no such query X a H G Xd, then X sc is selected. 

Given the user's answer u s , the selected query X s G 
{X sc , X a ; t } is added to P or jV accordingly. In the last step of 
the main loop the algorithm updates the cautiousness value c 
(function UPDATECAUTIOUSNESS) as described above. 

Before the next query selection iteration starts, a stop con- 
dition test is performed. The algorithm evaluates whether 
the most probable diagnosis is at least a% more likely than 
the second most probable diagnosis (ABOVETHRESHOLD) or 
none of the leading diagnoses has been eliminated by the pre- 
vious query, LcgetEliminationRate returns zero for X s . 
If a stop condition is met, the presently most likely diagnosis 
is returned (mostProbableDiag). 

4 Evaluation 

Goals. This evaluation should demonstrate that (1) there is 
a significant discrepancy between SPL and ENT concerning 
number of queries where the winner depends on the quality of 
meta information, (2) RIO exhibits superior average behavior 
compared to ENT and SPL w.r.t. the amount of user inter- 
action required, irrespective of the quality of specified fault 
information, (3) RIO scales well and (4) its reaction time is 
well suited for an interactive debugging approach. 
Provenance of test data. As data source for the evaluation 
we used faulty real-world ontologies produced by automatic 
ontology matching systems (OMSs) (cf. Example 1). 

Definition 7 (Ontology matching) ^Shvaiko and Euzenat, 
20121 Let Q(0) C S(0) denote the set of matchable el- 
ements in an ontology O, where S(0) denotes the signa- 
ture of O. An ontology matching operation determines an 
alignment Mij, which is a set of correspondences between 
matched ontologies Oi and Oj. Each correspondence is a 4 



The average size of OiMj P er matching system was between 
312 and 377 axioms. Dl is a superset of the dataset used in 



tuple 



such that Xi G Q(Oi), xj G Q(Oj), 



a semantic relation and v £ [0,1] is a confidence value. We 
call OiMj '■— Ci U U Oj the aligned ontology for Oi 

and Oj where a maps each correspondence to an axiom. 

Let in the following Q(0) be the restriction to atomic con- 
cepts and roles in S( O), r € {□,□,=} and a the natural 
alignment semantics l |Meilicke and Stucken schmidt, 2009) 
that maps correspondences one-to-one to axioms of the form 
Xi r Xj. We evaluate RIO using aligned ontologies by 
the following reasons: (1) Alignments often cause incon- 
sistency/incoherence of ontologies. (2) The (fault) struc- 
ture of different ontologies obtained through matching gen- 
erally varies due to different authors and matching systems 
involved. (3) For the same reasons, it is hard to estimate the 
quality of fault probabilities, i.e. it is unclear which exist- 
ing query selection strategy to choose for best performance. 
(4) Availability of correct reference alignments. 
Test datasets. We used two datasets Dl and D2: Each faulty 
aligned ontology OiMj in Dl is the result of applying one of 
four OMSs to a set of six independently created ontologies in 
the domain of conference organization. For a given pair of on- 
tologies Oi 7^ Oj, each system produced an alignment Mij. 



I Stuckenschmidt, 2008 1 for which all debugging systems un 
der evaluation manifested correctness or scalability problems. 
D2, used to assess the scalability of RIO, is the set of ontolo- 
gies from the ANATOMY track in th e Ontology Alignment 
Evalu ation Initiative^ (OAEI) 2011.5 i Shvaiko and Euzenat, 
20121, which comprises two input ontologies Oi (11545 ax- 



ioms) and Oi (4838 axioms). The size of the aligned on- 
tologies generated by results of seven different OMSs was 
between 17530 and 17844 axioms. 

Reference Solutions. For dataset Dl, based on a manually 
produced ref erence alignment TZq C Mij for ontologies 



Oi,Oj (cf. fMeihcke et ah, 2008j) , we were able to fix a 



target diagnosis 'W := a(Mij \ l-iij) for each incoherent 
OiMj- m cases where T>* represented a non-minimal diag- 
nosis, it was randomly redefined as a minimum diagnosis 
T>* C a(A4ij \ TZ>ij). In case of D2, given ontologies 0\ 
and Oi, matching output M\ 2 , and the correct reference 
alignment Tl.\i, we fixed T>* as follows: We carried out (prior 
to the actual experiment) a debugging session with DPI 
(a(M 12 \ K 12 ), Oi U Oi U a(M w H K 12 ), 0, 0) {coherence} 
and randomly chose one of the identified di agnoses as T>* . 
Note, it is common in OMS iMei licke, 201 1) that V* can be 
a subset of T) := a (Mij \ Tlij) as there is no evidence based 
on coherence to classify any ax G a(T> \ T>*) as faulty. 
Test settings^] We conducted four experiments EXP-i (i = 
1, . . . , 4), the first two with dataset Dl and the other two with 
D2. In experiments 1 and 3 we simulated good fault proba- 
bilities by setting p(ax k ) := 0.001 for axk G Oi U Oj and 
p(ax m ) := 1 — v m for ax m G Mij, where v m is the con- 
fidence of the correspondence underlying ax m . Low quality 
fault information was used in experiments 2 and 4. In EXP-4 
the following probabilities were defined: p(axk) := 0.01 for 
axk £ Oi U Oj and p(ax m ) := 0.001 for ax m G Mij. 
In EXP-2 we used probability settings of EXP-1, but fixed a 
completely unlikely target diagnosis in that we precomputed 
(prior to the actual experiment) the 30 most probable mini- 
mal diagnoses, and from these selected the one including the 
highest number of axioms ax}. G OiMj \ &{M-ij) as D* . 

In all experiments, we set |D| := 9 which proved to be 
a good trade-off between computation effort and representa- 
tiveness of leading diagnoses, a := 85% and as input param- 
eters for RIO c := 0.25 and [c, c] := [c q ,c^] = [0, §]. To let 
tests pose the highest challenge for the evaluated methods, the 
initial DPI was specified as (O iM j,®, 0, 0) {coherence}' i- e - the 
full search space was explored without adding parts of OiMj 
to B. In practice, given prior knowledge of correct axioms, 
adding those to B can severely restrict the search space and 
greatly accelerate debugging. All tests were executed on a 
Core-i7 (3930K), 32GB RAM with Ubuntu 1 1.04 and Java 6. 
Metrics. Each experiment involved a debugging session of 
ENT, SPL as well as RIO for each ontology in the respective 
dataset. In each session we measured the number of required 
queries (q) until T>* was identified, the overall debugging time 
{debug) assuming that queries are answered instantaneously 
and the reaction time (react), i.e. the average time between 

1 http ://oaei . ontology matching . org 

2 See http://code.google.eom/p/rmbd/wiki/ for code and details. 
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Figure 1 : |(a)| Percentage rates in how many debugging sessions which strategy performed best/better w.r.t. the required user interaction, i.e. 
number of queries. EXP-1 and EXP-2 involved 27, EXP-3 and EXP-4 seven debugging sessions each. q str denotes the number of queries 
needed by strategy str and min is an abbreviation for min(gsPL, 9ent). (b) Average time (sec) for the entire debugging session (debug), 



average time (sec) between two successive queries (react), and average number of queries (q) required by each strategy. 
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(a) (b) (c) 

Figure 2: |(a)|(b)| The bars show the average number of queries (q) needed by RIO, grouped by matching tools. The lower (upper) end of 
the whisker indicates the average q needed by the per-session better (worse) strategy in {SPL,ENT}. |(c)| Box-Whisker Plots presenting the 
distribution of overhead (q w — qt)/qt * 100 (in %) per debugging session of the worse strategy q w := max(gspL, 9ent) compared to the 
better strategy qt := min(gsPL, 9ent)- Mean values are depicted by a cross. 



two successive queries. The queries generated in the tests 
were answered by an automatic oracle by means of the target 
ontology O t := O iMj \ V*. 

Observations. The difference w.r.t. number of queries 
per test run between the better and the worse strategy in 
{SPL,ENT} was absolutely significant, with a maximum of 
2300% in EXP-4 and averages of 190% to 1145% through- 
out all experiments (Figure fflc")) - Moreover, results show that 
varying quality of fault probabilities in {EXP-1, EXP-3 } com- 
pared to {EXP-2, EXP-4} clearly affected the performance of 
ENT and SPL (see first two rows in Figure |l(a)[ ). This per- 
fectly motivates why a risk-optimizing strategy is suitable. 
Results of both experimental sessions, (E XP-1, EX P-2) and 



2(b) 



(EXP-3,EXP-4), are summarized in Figures 2(a) and 
spectively. The figures show the (average) number ot queries 
asked by RIO and the (average) differences to the number 
of queries needed by the per-session better and worse strat- 
egy in {SPL,ENT}, respectively. The results illustrate clearly 
that the average performance achieved by RIO was always 
substantially closer to the better than to the worse strategy. 
In both EXP-1 and EXP-2, throughout 74% of 27 debugging 
sessions, RIO worked as efficiently as the best strategy (Fig- 
ure 1(a)) . In 26% of the cases in EXP-2, RIO even outper- 
formed both other strategies; in these cases, RIO could save 
more than 20% of user interaction on average compared to 
the best other strategy. In one scenario in EXP-1, it took ENT 
31 and SPL 13 queries to finish, whereas RIO required only 
6 queries, which amounts to an improvement of more than 
80% and 53%, respectively. In (EXP-3,EXP-4), the savings 
achieved by RIO were even more substantial. RIO manifested 
superior behavior to both other strategies in 29% and 71% 
of cases, respectively. Not less remarkable, in 100% of the 
tests in EXP-3 and EXP-4, RIO was at least as efficient as 



the best other strategy. Recalling Figure 2(c) this means that 
RIO can avoid query overheads of 2200%. Figure [T(b)| which 
provides average values for q, react and debug per strategy, 
demonstrates that RIO is the best choice in all experiments 
w.r.t. q. Consequently, RIO is suitable for both good and poor 
meta information. As to time aspects, RIO manifested good 
performance, too. Since times consumed in (EXP- 1 , EXP-2) 
are almost negligible, consider the more meaningful results 
obtained in (EXP-3,EXP-4). While the best reaction time in 
both experiments was achieved by SPL, we can clearly see 
that SPL was significantly inferior to both ENT and RIO con- 
cerning q and debug. RIO revealed the best debugging time in 
EXP-4, and needed only 2.2% more time than the best strat- 
egy (ENT) in EXP-3. However, if we assume the user be- 
ing capable of reading and answering a query in, e.g., 30 sec 
on average, which is already quite fast, then the overall time 
savings of RIO compared to ENT in EXP-3 would already ac- 
count for 5%. Doing the same thought experiment for EXP-4, 
RIO would save 25% (w.r.t. ENT) and 50% (w.r.t. SPL) of 
debugging time on average. All in all, the measured times 
confirm that RIO is well suited for interactive debugging. 

5 Conclusions 

We have shown problems of state-of-the-art interactive ontol- 
ogy debugging strategies w.r.t. the usage of unreliable meta 
information. To tackle this issue, we proposed a learning 
strategy which combines the benefits of existing approaches, 
i.e. high potential and low risk. Depending on the perfor- 
mance of the diagnosis discrimination actions, the trust in the 
a-priori information is adapted. Tested under various con- 
ditions, our algorithm revealed good scalability and reaction 
time as well as superior average performance to two common 
approaches in the field w.r.t. required user interaction. 
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